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Abstract. This paper describes the parallelization strategy and achieved parallel effi- 
ciency of an explicit time-marching algorithm for solving conservation laws. The Space- 
Time Conservation Element and Solution Element (CE/SE) algorithm for solving the 
2D and 3D Euler equations is parallelized with the aid of domain decomposition. The 
parallel efficiency of the resultant algorithm on a Silicon Graphics Origin 2000 parallel 
computer is checked. 


1 Introduction 

To accurately resolve complex three-dimensional turbulent flows, with possible 
acoustic waves or combustion phenomena, a time-dependent Euler or N-S solver 
must be used in Large Eddy Simulation or Direct Numerical Simulation mode. 
The need to resolve multiple space and time scales leads to very large computer 
memory and operation count requirements. The use of such computational re- 
sources is economically feasible only on modern parallel computing architectures. 
This paper presents the strategy and specific tools that were used to parallelize 
an existing serial code for solving the 3D Euler equations. While the concepts 
involved have been explored by other researchers over many years, the present 
authors hope that the description of a recent simple parallelization effort will be 
of value to many who are currently switching from serial supercomputers to more 
economical parallel architectures but who are not experts in parallel computing. 

2 The CE/SE Euler Scheme 

The Space-Time Conservation Element and Solution Element (CE/SE) method 
advanced by Chang [1] offers a simple numerical scheme that accurately resolves 
both shock waves and acoustic waves in the nearfield in a unified manner. The 
CE/SE 2D Euler solver is described in [2], while the CE/SE 3D Euler solver was 
presented in [3]. See the web site http://www.lerc.nasa.gov/WWW/microbus 
for details of the CE/SE method and a complete list of publications, including 
those detailing application of CE/SE schemes to computational aeroacoustics 
and combustion. 

The success of the CE/SE technique for aeroacoustics, combustion, and other 
computations can be traced to the distinguishing features of the method, which 
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are (i) an emphasis on solving a general space-time integral form of the conser- 
vation laws, (ii) the requirement of local and global conservation of space-time 
fluxes as a basis for the method, (iii) use of only the simplest discretization 
stencils on a staggered space-time mesh, and (iv) the treatment of flow property 
gradients as additional unknowns. The basic inviscid flow solvers are nondissipa- 
tive (no inherent artificial viscosity) and neutrally stable. A specifiable amount 
of artificial viscosity is explicitly added to provide stabilizing damping and to 
minimize numerical oscillations in the vicinity of shock waves. Thus the CE/SE 
Method is well suited to aeroacoustics and combustion applications. 

3 Code Parallelization via Domain Decomposition 

The CE/SE Euler scheme is an explicit time-marching scheme, making it well- 
suited for parallelization via domain decomposition. If the solution is to be 
computed at a time level n, the computational stencil involves a single cell at the 
time level n, and involves only the cell and its immediate neighbors at the time 
level n — 1. The algorithm to compute the solution is the same for every interior 
cell. Therefore, the scheme possesses a data parallelism. The only dependency in 
this data parallelism is that the solution and geometric information pertaining to 
immediate neighbors at the time level n — 1 needs to be known by any cell before 
the solution at that cell can be computed at the level n. The natural technique 
for parallelization is by dividing the cells among the available processors. Given 
the low speed of communication between processors relative to the speed with 
which a processor accesses its local memory, the division should be such that 
communication between processors is minimized. 

To enable efficient parallelization of the CE/SE codes, a suitable domain 
decomposition strategy for the numerical flow problem has been selected. This 
involves use of one of the METIS codes based on algorithms described in [4]. 
Specifically, the dual- mesh partitioning code named partdmesh is used. It is 
available at no cost, and can be downloaded from the website http://www- 
users.cs.umn.edu/~karypis/metis/main.shtrnl. Given an unstructured mesh as a 
map of cell numbers to node (cell vertex) numbers, the METIS mesh-partitioning 
code produces a A>way partition of the domain, where the input k is the desired 
number of subdomains. Generally, k is set equal to the number of computer 
processors used for the flow computation. The METIS code can partition both 
2D meshes of triangular cells and 3D meshes of tetrahedral cells. The parti- 
tioning algorithm attempts to produce an optimal partition in three respects. 
The algorithm attempts to balance the load, i.e., to assign an equal number of 
cells to each subdomain. The algorithm also attempts to minimize the number of 
shared cell faces at the inter-subdomain boundaries, so as to lessen the amount 
of inter-processor communication required. Thirdly, the algorithm also attempts 
to minimize the maximum of the number of subdomains adjacent to any subdo- 
main. The METIS code is based on a multilevel graph-partitioning algorithm, 
and is very rapid, producing a 32-way partition of a 96800-cell domain in about 
2 seconds on a personal computer with a 450 MHz Pentium-II processor. In 
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the current work, only the serial computer version of the METIS code is used, 
although a parallel computer version is also available. Other mesh-partitioning 
algorithms are available, but have not been evaluated for this work. In our ex- 
perience, the METIS code produces partitions that are close to optimal, at very 
little computational cost. 

The effort reported here consisted of “parallelizing” already existing 2D and 
3D research codes that were written to run in serial fashion on a single processor. 
The parallelization was done in such a way as to introduce minimal modification 
of the existing flow solvers. The chief aim was to reduce the wallclock time 
of any given computation by using multiple processors. With regard to the 
actual flow computation, an additional benefit of the domain decomposition 
strategy is that the overall computer memory requirement is divided among the 
processors. However, little attention was paid to dividing among the processors 
the overall memory requirement of the domain decomposition and subdomain 
setup portions of the procedure. To enable the codes to be run on both parallel 
computers and workstation clusters, a message-passing approach was taken. 

The CE/SE scheme is formulated on an unstructured mesh of triangles in the 
planar or axisymmetric case, and on an unstructured mesh of tetrahedra in the 
three-dimensional case. For simplicity of presentation, the case of a planar mesh 
of triangles will be discussed, though the parallelization procedure for volume 
meshes of tetrahedra is similar. The output typical of an unstructured mesh 
generator consists of the number of nodes, V, the number of cells, C , the number 
of boundary cell faces, B , the spatial coordinates X of each node, the map C2V 
of cell number to the node numbers of its vertices (i.e., the cell definitions), a 
map B2T of the boundary face number to a label denoting the type of boundary 
condition, the map B2V of the boundary face number to the node numbers of 
its vertices, and the map B2C of the boundary face number to the cell number 
of the cell to which the boundary face belongs. If the cell c has vertices with 
node numbers vi, Vn and t> 3 , then C2V(c 7 i) = Vi, i — 1,2,3. The map C2V is 
supplied as input to the METIS mesh partitioning code together with the total 
number P of processors or subdomains desired. The output from the METIS 
code is a map C2P of cell number to processor number, where the processor 
number to which the cell is assigned lies in the range 0 through P — 1. 

A simple example of a planar mesh is considered in Fig. 1. The mesh happens 
to be structured. However, structured meshes are treated in exactly the same 
way as a general unstructured mesh. The mesh of triangles covers the domain 
ABCDEFGHA. The total number of cells here is C = 8. The nodes are 
labeled A through I in the figure, to avoid confusion with the cell numbers. 
Suppose that the mesh is partitioned into two subdomains, with cells 1 through 
4 being assigned to Processor 0, and cells 5 through 8 to Processor 1. The current 
practice in the CE/SE schemes is to implement boundary conditions by adjoining 
“ghost” cells of appropriate geometry outside the domain of the flow problem. 
These ghost cells, one at each boundary face, are numbered —1 through —B, see 
Fig. I. A ghost cell at any boundary face that does not call for a periodicity 
type of boundary condition is assigned to the same processor as the interior cell 
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with which it shares the boundary face. A ghost cell at a periodic boundary 
face is given the same geometric shape and assigned the same processor as the 
interior cell at the periodic image of the boundary face. For example, in Fig. 1, 
if the faces AB and GF are locations of periodicity of the solution, and if cell 2 
is assigned to processor 0 while cell 5 is assigned to processor 1, then cell —6 is 
the periodic image of cell 2 and is assigned to processor 0, while cell -1 is the 
periodic image of cell 5 and is assigned to processor 1. 

A face numbering is next generated, and the face-to-cell map is calculated. 
For very large meshes, this process may require too much memory to be done 
on a single processor, and a less memory- intensive approach would be needed 
to generate equivalent information. The cells marked for each processor p are 
given cell numbers “local” to the processor. Thus, the local numbering of the 
interior cells of processor p runs from 1 through C p , and that of the ghost cells of 
processor p runs from —1 through —B p . Local node numbering of the relevant 
nodes is also performed. The subdomain setup code then scans the face-to-cell 
map to find faces that are shared by processor p with a different processor. If 
there are D p such faces, the cell at each face that is “foreign” to the processor 
p is assigned a local cell number in the range C v + 1 through C p + D p , in the 
order in which the face is encountered in the face-to-cell map. The number of 
neighboring processors of any processor p is counted during this scan, and for 
each neighboring processor, the cell numbers (local to processor p) of both the 
“native” cells and the “foreign” cells are stored in the order in which the face is 
encountered in the face-to-cell map. An important property of this algorithm is 
that faces shared by two given processors p and p f are encountered in the same 
order when considering either p or p'. During the identification and local num- 
bering of cells relevant to any processor p, maps from the local to global cell and 
node numberings are constructed, together with the inverse maps. Finally, for 
each processor p, mesh information is translated from global to local numbering 
and written to a disk file (with filename keyed to the processor number) about 
all the native interior and ghost cells, and about all the “neighboring” cells at 
the shared border of the subdomain. The “neighboring” cells appear merely as 
passive cells on processor p. They are used to store the flow data transferred 
from the neighboring processors. For each processor, the list of neighboring pro- 
cessors and the lists of native and foreign cells at the border with each processor 
are also written to disk files. 

For a A:- way partition of the domain, a CE/SE program executable is loaded 
into the node memory of each of k processors. Each of these processes then 
computes the solution in the corresponding subdomain. Only the cells assigned 
to each processor, together with passive border cells belonging to neighboring 
processors are stored, and the numbering is local. After each time step, the 
solution at the passive border cells is updated. This is done by message passing 
between processors, using Message Passing Interface (MPI) library calls. See 
[5] and the web site http://vmw-unix.mcs.anl.gov/mpi for details of the MPI 
standard. If necessary, the solution on the complete domain can be stitched 
together from the solutions on the individual subdomains by postprocessing. 
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4 Achieved Parallel Speedup 


The parallel efficiency of the 3D Euler code when running on an SGI Origin 
2000 computer architecture has been measured. The Origin 2000 uses a hyper- 
cube network topology to maximize the communication bandwidth. The parallel 
speedup achieved on this architecture is shown in Figs. 2 and 3. The test involved 
computing 100 time steps on a mesh of 172800 cells, using 1, 2, 4, 8, 16, 32 and 
64 processors. The test problem was the reflection of an oblique shock wave 
from a wall, previously computed using the 2D Euler code in [2] , but computed 
here using the 3D code. The parallel speedup of a run is defined as the inverse 
ratio of the wallclock time to the wallclock time of the single- processor case. 
The MPLWTIME function call was used to reliably recover the wallclock time 
for each processor. The wallclock time of a run was taken to be the maximum 
time among the processors. The parallel speedup is seen to deviate visibly from 
ideal in the 32 processor case, indicating that for this implementation on this 
computer architecture, about 5400 mesh cells per processor is the limit below 
which the higher ratio of communication to computation will give poor parallel 
efficiency. Using 64 processors for this mesh size (about 2700 cells per processor) 
leads to a parallel speedup of only about 48. The same flow problem was run 
on a mesh four times as large, i.e. 691200 cells, with 64, 128 and 256 processors. 
This mesh was too large to be run on a single processor. Therefore, the 64 
processor case was treated as having ideal speedup of 64, in order to examine 
the rate of falloff of the speedup, plotted in Fig. 3. Comparison of Figs. 2 and 
3 shows that the rate of falloff (i.e., the slope) is approximately the same at 64 
processors in Fig. 2 as it is at 256 processors in Fig. 3. 

5 Summary and Conclusions 

The CE/SE 2D and 3D Euler schemes were extended from serial computer ver- 
sions to parallel computer versions. The parallel speedup of the 3D code was 
significantly less than ideal when an increased number of processors brought the 
number of mesh cells per processor below 5400. This indicates that tuning of 
the interprocessor communication will be beneficial when using a large number 
of processors on moderate sized meshes. 
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